post thumbnail

Python Web Scraping for Image Data Cleaning and Storage: A Step-by-Step Tutorial with a Stable Diffusion Example

Python web scraping tutorial for image datasets using Stable Diffusion. Learn requests + parsel extraction, downloads, WebP to JPEG conversion with Pillow, resizing, compression, denoising and normalization with OpenCV. Store images via MongoDB GridFS or file paths, and integrate Amazon S3/CDN. Includes step-by-step code, quality checks, and storage best practices.

2025-11-03

Python web scraping for image data cleaning and storage is a practical workflow when you need large-scale images for training and fine-tuning models. In many projects, you must not only scrape image URLs, but also download, convert formats (like WebP), standardize size, normalize pixels, enhance quality, and finally store images reliably for later use.

In this tutorial, we use a Stable Diffusion gallery page as an example to demonstrate an end-to-end pipeline: scrape → download → clean/process → store.


What you will build

By the end, you will have:


Step 1: Find the image download URL (src)

Open the Stable Diffusion gallery in your browser and press F12 to open Developer Tools. Then, inspect the image element and locate the src attribute. That src is the download URL you want to extract.

Because pages can vary, always confirm the CSS structure before writing XPath.


Step 2: Parse image src URLs with XPath (Parsel + Requests)

Below is a minimal example to request the page HTML and parse image URLs using XPath.

from parsel import Selector
import requests

url = "https://stabledifffusion.com/gallery"

payload = {}
headers = {
  'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
  'cookie': '_ga=GA1.1.258999226.1754806446; _ga_C4QP4FPRFF=GS2.1.s1754806445$o1$g1$t1754807302$j44$l0$h0',
  'pragma': 'no-cache',
  'referer': 'https://stabledifffusion.com/',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36'
}

def get_stable_diffusion_images():
    response = requests.request("GET", url, headers=headers, data=payload)
    text = response.text
    resp = Selector(text=text)
    image_urls = resp.xpath('//div[@class="grid grid-cols-1 md:grid-cols-3 gap-4"]/div[@class="max-w-sm"]/img/@src').getall()
    return image_urls

Example output:

[
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-1.webp",
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-2.webp",
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-3.webp",
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-4.webp",
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-5.webp",
  "https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-6.webp"
]

Step 3: Download images locally

Next, write a download function. For reliability, keep it simple first, then add retries if your project needs it.

import requests

def download_image(image_url, filename):
    response = requests.get(image_url)
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.content)
        print(f"Image {filename} downloaded successfully.")
    else:
        print(f"Failed to download image {filename}. Status code: {response.status_code}")

Step 4: Convert WebP to JPEG with Pillow

Stable Diffusion gallery images are often in WebP format. Fortunately, Pillow can read WebP directly in most environments.

from PIL import Image

# open WebP format image
with Image.open("sd-generate-1.webp") as img:
    # display image info
    print(f"format: {img.format}")
    print(f"size: {img.size}")
    # save as jpeg format
    img.save("image.jpg", "JPEG")

This step ensures compatibility with tools that prefer JPG/PNG.


Step 5: Resize and compress to reduce storage

Resizing is common because it saves storage and improves training throughput. Additionally, high-quality downsampling like LANCZOS reduces aliasing artifacts.

from PIL import Image

# Open WebP image
with Image.open("image.webp") as img:
    print(f"Original format: {img.format}")
    print(f"Original size: {img.size}")  # (width, height)

    new_width = img.size[0] // 2
    new_height = img.size[1] // 2
    new_size = (new_width, new_height)

    resized_img = img.resize(new_size, Image.Resampling.LANCZOS)

    print(f"New size: {resized_img.size}")

    resized_img.save("resized_image.jpg", "jpeg")  # Keep jpeg format

Resize to fixed input size (YOLO-style)

For YOLO training, images usually must be fixed-size (for example 640×640). Therefore, size normalization is required.

resized_img = img.resize((640, 640), Image.Resampling.LANCZOS)
resized_img.save("resized_image.jpg", "jpeg")

Step 6: Normalize pixel values (4 common methods)

Besides resizing, pixel normalization is another standard step for YOLO and general computer vision pipelines.

Supported methods:

import cv2
import numpy as np
from PIL import Image

def normalize_pixel_values(image, method='0-1'):
    """
    Image pixel value normalization function

    Parameters:
        image: Input image, can be a PIL Image or NumPy array
        method: Normalization method
                '0-1': Normalize to [0, 1] range
                '-0.5-0.5': Normalize to [-0.5, 0.5] range
                'z-score': Z-score standardization
                'uint8': Convert to 0-255 integers (denormalization)
    Returns:
        Normalized image
    """
    if isinstance(image, Image.Image):
        image = np.array(image)

    normalized = image.copy().astype(np.float32)

    if method == '0-1':
        if normalized.max() > 0:
            normalized = normalized / 255.0

    elif method == '-0.5-0.5':
        normalized = (normalized / 255.0) - 0.5

    elif method == 'z-score':
        mean = np.mean(normalized)
        std = np.std(normalized)
        if std > 0:
            normalized = (normalized - mean) / std
        else:
            normalized = normalized - mean

    elif method == 'uint8':
        normalized = np.clip(normalized, 0, 255).astype(np.uint8)

    else:
        raise ValueError(f"Unsupported normalization method: {method}")

    return normalized

if __name__ == "__main__":
    image_path = "input_image.jpg"

    cv_image = cv2.imread(image_path)
    cv_image_rgb = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)

    pil_image = Image.open(image_path)

    methods = ['0-1', '-0.5-0.5', 'z-score']
    for method in methods:
        normalized_cv = normalize_pixel_values(cv_image_rgb, method)
        print(f"Method: {method}, OpenCV image - Pixel range: [{normalized_cv.min():.4f}, {normalized_cv.max():.4f}]")

        normalized_pil = normalize_pixel_values(pil_image, method)
        print(f"Method: {method}, PIL image - Pixel range: [{normalized_pil.min():.4f}, {normalized_pil.max():.4f}]")

    normalized = normalize_pixel_values(cv_image_rgb, '0-1')
    denormalized = normalize_pixel_values(normalized, 'uint8')
    print(f"Denormalization - Pixel range: [{denormalized.min()}, {denormalized.max()}], Data type: {denormalized.dtype}")

    cv2.imwrite("denormalized_image.jpg", cv2.cvtColor(denormalized, cv2.COLOR_RGB2BGR))

Step 7: Image quality optimization (denoise and sharpen)

Sometimes scraped images are noisy, low-contrast, or slightly blurry. In that case, you can enhance feature clarity before training.

Common processing options:

A basic OpenCV example for denoising:

import cv2
import numpy as np
from matplotlib import pyplot as plt

def denoise_image(image_path, method='non_local_means'):
    """
    Image denoising processing

    :param image_path: Input image path
    :param method: Denoising method
                   'gaussian': Gaussian filtering
                   'median': Median filtering
                   'bilateral': Bilateral filtering
                   'non_local_means': Non-local means denoising
    :return: Denoised image
    """
    img = cv2.imread(image_path)
    if img is None:
        raise ValueError("Unable to read image, please check the path")

    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    if method == 'gaussian':
        denoised = cv2.GaussianBlur(img, (5, 5), 0)
    elif method == 'median':
        denoised = cv2.medianBlur(img, 5)
    elif method == 'bilateral':
        denoised = cv2.bilateralFilter(img, 9, 75, 75)
    elif method == 'non_local_means':
        denoised = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 21)
    else:
        raise ValueError(f"Unsupported denoising method: {method}")

    denoised_rgb = cv2.cvtColor(denoised, cv2.COLOR_BGR2RGB)

    plt.figure(figsize=(10, 5))
    plt.subplot(121), plt.imshow(img_rgb), plt.title('Original Image')
    plt.subplot(122), plt.imshow(denoised_rgb), plt.title(f'Denoised ({method})')
    plt.show()

    return denoised

if __name__ == "__main__":
    image_path = "lena.jpg"
    denoised = denoise_image(image_path, method='non_local_means')
    cv2.imwrite("denoised_lena.jpg", denoised)

Run:

python image_process.py

Step 8: Image storage strategies

For scalable storage, you can use Amazon S3 object storage or build your own storage engine. Meanwhile, MongoDB is also common, and there are two typical approaches:

  1. Store binary image data (suited for small images; large files use GridFS)
  2. Store image file path + metadata (recommended at scale and for high concurrency)

Method 1: Store image binary data in MongoDB (GridFS)

For files larger than 16MB, MongoDB recommends GridFS. GridFS splits a file into chunks (256KB by default), which works well for large images and videos.

from pymongo import MongoClient
from gridfs import GridFS

class MongoDBImageStorage:
    def __init__(self, db_name="image_database"):
        self.client = MongoClient('mongodb://localhost:27017/')
        self.db = self.client[db_name]
        self.fs = GridFS(self.db)

    def store_image(self, image_path, metadata=None):
        try:
            with open(image_path, 'rb') as f:
                image_data = f.read()

            filename = image_path.split('/')[-1]

            file_id = self.fs.put(
                image_data,
                filename=filename,
                content_type=f'image/{filename.split(".")[-1]}',
                **(metadata or {})
            )

            print(f"Image stored successfully. File ID: {file_id}")
            return file_id

        except Exception as e:
            print(f"Error storing image: {str(e)}")
            return None

    def retrieve_image(self, file_id, output_path):
        try:
            file = self.fs.get(file_id)
            image_data = file.read()

            with open(output_path, 'wb') as f:
                f.write(image_data)

            print(f"Image retrieved successfully. Saved to: {output_path}")
            return True

        except Exception as e:
            print(f"Error retrieving image: {str(e)}")
            return False

    def get_image_metadata(self, file_id):
        try:
            file = self.fs.get(file_id)
            return {
                "filename": file.filename,
                "content_type": file.content_type,
                "upload_date": file.upload_date,
                "length": file.length,
                "metadata": file.metadata
            }
        except Exception as e:
            print(f"Error getting metadata: {str(e)}")
            return None

    def delete_image(self, file_id):
        try:
            self.fs.delete(file_id)
            print(f"Image with ID {file_id} deleted successfully")
            return True
        except Exception as e:
            print(f"Error deleting image: {str(e)}")
            return False

if __name__ == "__main__":
    storage = MongoDBImageStorage()
    metadata = {"category": "nature", "resolution": "1920x1080"}
    file_id = storage.store_image("test_image.jpg", metadata)

    if file_id:
        print("Image metadata:", storage.get_image_metadata(file_id))
        storage.retrieve_image(file_id, "retrieved_image.jpg")
        # storage.delete_image(file_id)

Method 2: Store image paths + metadata (recommended at scale)

For large images or high-concurrency usage, it is more efficient to store images in a filesystem (local disk, NAS, or cloud storage) and only save the path/URL and metadata in MongoDB.

In practice, this approach simplifies CDN delivery, reduces database load, and improves read performance. Moreover, it is easier to scale as a distributed crawler system.

You can then expose image access via an API using FastAPI or Express, and with a domain name, you can build an S3-like storage service.


Practical checklist for production pipelines

To keep your pipeline stable:


FAQ (for SEO + GEO)

What is python web scraping for image data cleaning and storage?

It is a pipeline that extracts image URLs from web pages, downloads images, converts and processes them (resize, normalize, enhance), and stores them in a database or object storage for later model training.

Why convert WebP images to JPEG?

Some training or annotation tools prefer JPEG/PNG. Converting also improves compatibility across environments.

Should I store image binaries in MongoDB?

You can, especially with GridFS for large files. However, at scale, storing paths (and using object storage) is usually more efficient.

What image size should I use for YOLO training?

Common sizes include 640×640 or 416×416. The correct choice depends on your model and dataset, but fixed-size input is a standard requirement.


Suggested image alt text

Use descriptive alt text that includes the topic naturally:


Conclusion

Python web scraping for image data cleaning and storage becomes much easier when you treat it as a structured pipeline: scrape URLs, download images, clean and normalize them, enhance quality when necessary, and store them with an approach that scales. With MongoDB GridFS or a path-based strategy plus cloud storage, your crawler can grow into a distributed system that centralizes image data for training and fine-tuning.